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ABSTRACT 


Prediction of a novel or potential lead molecules for a therapeutic drug target without adverse effects is a challenging task in the 
drug designing, discovery, and development process. The systematic integration of multi-omics data from various data/knowl¬ 
edge bases through computational techniques enables to identify potential lead molecules and study the therapeutic properties. 
Over the last decades, several drug discoveries using multi-omics and huge dataset integration methods proven with successive 
results. In this paper, we present different types of computational approaches for prediction of potential lead molecules through 
the systems-level integration of multi-omics datasets. 
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INTRODUCTION 

In drug discovery, lead is a chemical compound that binds 
to active site regions of the biological target molecule and 
hence minimizes the binding free energy. Leads may be a 
natural product, synthetic, or semi-synthetic compound 
which has therapeutic effects! 1].Natural product (or natural 
drug) consists of bioactive compounds which were produced 
by the living organisms that are present in nature. Plants, 
minerals, and animals (including microorganisms) are the 
common sources of natural products[2,3,69,74,75]. Natural 
products can also be developed by chemical synthesis (both 
semi-synthesis and total synthesis) and have been placed a 
major role in the development of potential synthetic targets. 
But synthetic and semi-synthetic compounds are chemically 
synthesized by the humans in the laboratory using in silico 
and/or experimental approaches[4,5]. 

Developing a potential lead molecule by using the experi¬ 
mental method is tedious, complicated, expensive, time- 
consuming, and trial-and-error process[6]. Recently, many 
advanced computational techniques analogous to wet-lab 
techniques were introduced to reduce the problem. Modem 
computer-aided drug design and discovery (CADDD) in¬ 
volve virtual screening, testing, and validation of lead mol¬ 


ecules in a short time span using large datasets and software 
[73]. The resulting lead molecule further undergoes a series 
of preclinical and clinical studies to test the toxicity and ad¬ 
verse effects. The successful dmg candidate is released in 
different dosage forms in the market after passing the food 
and dmg administration (FDA) verification process[7,8]. 

MULTI-OMICS AND BIG DATA INTEGRATION 

Multi-omics is a new approach for analyzing biological 
problems in various aspects through combining multiple- 
omics datasets[3,9]. The common types of omics include 
genomics, proteomics, metabolomics, epigenomics, phyto- 
chemomics, interactomics, and microbiomics[10-12]. Inte¬ 
gration of multiple omics data in a systematic way enables to 
study the functional relationship or identify the key problem 
in an efficient manner. An association of large datasets or 
complex datasets of multi-omics data is a difficult task and 
must have sound knowledge in all areas of omics.The pattern 
matching (or regular expression) is a general and most popu¬ 
lar technique for extraction of knowledge from the datasets. 
Analyzing the large multi-omics datasets involves big data 
handling. 
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Due to rapid growth in data size, diversity, and complexity of 
datasets in the biological databases, big data were introduced 
to analyze, manage, and derive knowledge from the datasets. 
Big data (aka huge data or massive data) refer to a very large 
volume of data or data storage, which cannot be processed 
using traditional computing devices and applications. Size 
of big data ranges from petabytes(l PB = 10 15 bytes) to exa¬ 
bytes (1 EB = 10 18 bytes), or even more[13-15].Even though 
the big data analysis is a hot topic today, the concept has 
evolved over many years ago in IT and R&D sector. Next- 
generation sequencing (NGS) and drug discovery are the two 
most popular areas of biological sciences which currently 
implement big data analysis in knowledge discovery[16-18]. 

Comprehensive Data Integration Methods 

Integrating comprehensive and related datasets from various 
biological databases or other external sources increases the 
accuracy in lead prediction, and also reveals hidden func¬ 
tions and interrelationship within the molecules[19].There 
are three types of approaches adopted to combine compre¬ 
hensive data and reduce data size (Table 1): ( i ) semantic web 
approach - searching, retrieving, or annotating data from 
other external data sources through metadata or a RESTfill 
APIweb services [20,21]; (ii) data warehousing approach - 
extracting data from other external sources and combining 
into a global dataset[ 19,22]; and (iii) data mining approach 
- extracting data or knowledge from different types of large 
datasets through suitable pattems[23,24]. 

Most of the popular three-dimensional (3D) molecular struc¬ 
ture databases such as RCSB Protein Data Bank[25], NCBI 
PubChem[26], EMBL-EBI ChEBI [27], Drag Bank[28], etc. 
have implemented REST fill API web services or SOAP to 
share or integrate data in the fonn of FTP, HTML, XML, 
JSON, plain text, or AWK commands[29].Moreover, cloud 
computing services were offered to handle, analyze, or inter¬ 
pret big datasets through various remote applications/serv¬ 
ers. There are many cloud servers such as Cloud BLAST[30], 
Myma[31], Cloud Burst[32], Hadoop-BAM[33], GPU- 
BLAST[34], Hydra[35], Peak Ranger[36],Crossbow[37], 
etc. were available over cloud for analyzing different types 
of big datasets [38-41], 

Unsupervised Data Analysis and Analytics 

Handling big dataset or multi-omics data is a difficult task, 
because it is often very comprehensive and available in real 
time. In Bioinformatics, sequence (alphabets) and struc¬ 
ture (XYZ coordinates) are the major data used for big data 
analysis and analytics. An effective lead identification and 
functional interrelationship prediction require integration 
of very large datasets of3D chemical libraries and disease- 
target-ligand interaction network. Usually unsupervised 
multi-omics/big datasets are integrated using clustering and 
grouping technique. The different types of dataset integra¬ 



tions are target-ligand interactions, intermolecular interac¬ 
tions, disease-target interactions, disease-disease relation¬ 
ships, protein-protein interactions, target-disease-metabolic 
pathways, drug-side effect relationships, gene interactions, 
structure-function relationships, etc. [42-44]. 

The network model graphical representation of biological 
data interrelations and various types of unsupervised dataset 
integration methods are [44,56]:(i) network-based methods - 
graphical representation of interrelations using the network 
(distance) datasets [45,46],(/7) Bayesian methods - probabil¬ 
istic graphical representation of interrelations using the prob¬ 
ability distribution datasets [47-51], (iii) correlation-based 
methods - multivariate graphical representation of interre¬ 
lations using the partial least squares datasets [52,53], (iv) 
matrix factorization methods - graphical representation of 
interrelations using the product and rank of the two matrix 
datasets [54], and (v) kernel-based methods - graphical rep¬ 
resentation of interrelations using the pattern datasets pre¬ 
dicted from kernel matrix [55]. 

Big Data Accessing Methods 

Accessing large datasets requires high-performance comput¬ 
ing (HPC) infrastructure and a suitable big data framework 
[14,15]. The common methods for big data handling are 
cloud computing, graphics processing unit (GPU) comput¬ 
ing, Xeon Phi computing, grid computing, and cluster com¬ 
puting [57,58]. Large datasets can be accessed from various 
data sources using big data framework, which is based on cli¬ 
ent-server technology [59]. There are many types of big data 
processing frameworks used for accessing datasets through a 
pipeline, among which popularly used frameworks and pro¬ 
grams are: Apache Hadoop [76], Apache Spark [77], Apache 
Flink [78], Apache Storm [79], Apache Samza [80], Apache 
Cassandra [81], NoSQL [82], R [83], and Python[84]. 

SYSTEMATIC MULTI-OMICS DATA INTEGRA¬ 
TION 

A successful drug discovery requires exact compound or 
most suitable compound which can fit all pocketsin the ac¬ 
tive site of the target molecule and brought to a stable state 
[7,8]. The systematic integration of theoretical and experi¬ 
mental datasets of multi-omics, target-ligand interaction net¬ 
work, physicochemical properties, and functional properties 
leads to design a safe and efficient therapeutics [60]. 

Integrative Systems Biology Approach 

To design an effective drug molecule, it is most essential to 
understand the nature and causes of the disease [61].Integra¬ 
tive systems biology advances thorough study of biological 
phenomenon of a system (organism, e.g. human) in a sys¬ 
tematic way (Figure l).The complex interaction networks 
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in a system can be combined through either top-down or 
bottom-up approaches using multi-omics datasets [62,63]. 
Currently there are many bioinformatics databases and tools 
were available for collection of various omics data and hence 
can design a new virtual system. 

Computational Methods for Lead Identification 

A lead molecule can be identified by integrating or compar¬ 
ing target data with large datasets using computational and 
statistical approaches. The common computational lead 
identification techniques using large datasets include: 

i. Multiple sequence alignment -It is a popular method to 
find local similarity, homology, and phylogenetic rela¬ 
tionship between different genes or protein sequences 
[41]. The sequence similarity through structure-based 
sequence alignment enables to find the similar target- 
ligand interacting molecules. Structural superposition 
is another alternative approach to compare similar 
protein structures based on the root mean square devi¬ 
ation (RMSD) calculation [64]. Moreover, systematic 
integration of large datasets of target-ligand molecular 
interaction network data with multi-omics data ena¬ 
bles to predict or design a potential lead molecule [60]. 

ii. Maximum common substructure - It is a widely 
used method in CADD for finding similar 3D struc¬ 
tures through structured-based or ligand-based vir¬ 
tual screening [60].Maximum common substructure 
search using SMILES (Simplified Molecular Input 
Line Entry System)pattern is commonly used to find 
structural similarity between large chemical datasets 
[65].The substructure search with compounds in the 
phenotype linked target-ligand interacting network 
datasets integrated with multi-omics data enables to 
predict or design a novel and potential lead molecule 
[ 66 - 68 ], 

iii. Molecular interaction network - It is the modern 
and most successive approach to find a novel drug 
by systematic integration of large datasets of multi- 
omics data [60].Data scientists integrates big data into 
complex network in the order of phenotype —> target 
-^■target-ligand-*—ligand-*— chemical library to predict 
or design a novel and potential lead molecule (Figure 
2). Recently, many big pharmaceutical companies and 
R&D organizations have renewed their interest in dis¬ 
covering potential lead compounds from the natural 
products due to the structural diversity and medicinal 
properties [3,69,70]. 

CONCLUSION 

Biological systems are analogous to the computer system in 
disease/target identification and drug design. To troubleshoot 
hardware issues in the computer, we must have the complete 
circuit diagram and the component to fix the problem [71]. 
In contrast, through increasing the volume of multi-omics 


datasets and systematic integration of large datasets, it is 
possible to design an effective drug molecule [72]. Recent 
research advances in cloud computing, big data analysis, 
multi-omics data integration, and virtual screening and test¬ 
ing technology have reduced the cost and time in predicting 
potential lead molecules. 
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Figure 1: An illustration of multi-omics data integration through integrative systems biology approach. 



Figure 2: An illustration on multiple target and phytochemical interaction network. 
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